14 research outputs found
Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter
Typical text spotters follow the two-stage spotting strategy: detect the
precise boundary for a text instance first and then perform text recognition
within the located text region. While such strategy has achieved substantial
progress, there are two underlying limitations. 1) The performance of text
recognition depends heavily on the precision of text detection, resulting in
the potential error propagation from detection to recognition. 2) The RoI
cropping which bridges the detection and recognition brings noise from
background and leads to information loss when pooling or interpolating from
feature maps. In this work we propose the single shot Self-Reliant Scene Text
Spotter (SRSTS), which circumvents these limitations by decoupling recognition
from detection. Specifically, we conduct text detection and recognition in
parallel and bridge them by the shared positive anchor point. Consequently, our
method is able to recognize the text instances correctly even though the
precise text boundaries are challenging to detect. Additionally, our method
reduces the annotation cost for text detection substantially. Extensive
experiments on regular-shaped benchmark and arbitrary-shaped benchmark
demonstrate that our SRSTS compares favorably to previous state-of-the-art
spotters in terms of both accuracy and efficiency.Comment: To be appeared in the Proceedings of the ACM International Conference
on Multimedia (ACM MM), 202
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning
Due to the flexible representation of arbitrary-shaped scene text and simple
pipeline, bottom-up segmentation-based methods begin to be mainstream in
real-time scene text detection. Despite great progress, these methods show
deficiencies in robustness and still suffer from false positives and instance
adhesion. Different from existing methods which integrate multiple-granularity
features or multiple outputs, we resort to the perspective of representation
learning in which auxiliary tasks are utilized to enable the encoder to jointly
learn robust features with the main task of per-pixel classification during
optimization. For semantic representation learning, we propose global-dense
semantic contrast (GDSC), in which a vector is extracted for global semantic
representation, then used to perform element-wise contrast with the dense grid
features. To learn instance-aware representation, we propose to combine
top-down modeling (TDM) with the bottom-up framework to provide implicit
instance-level clues for the encoder. With the proposed GDSC and TDM, the
encoder network learns stronger representation without introducing any
parameters and computations during inference. Equipped with a very light
decoder, the detector can achieve more robust real-time scene text detection.
Experimental results on four public datasets show that the proposed method can
outperform or be comparable to the state-of-the-art on both accuracy and speed.
Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on
Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce
RTX 2080 Ti GPU.Comment: Accepted by ACM MM 202
GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction
All tables can be represented as grids. Based on this observation, we propose
GridFormer, a novel approach for interpreting unconstrained table structures by
predicting the vertex and edge of a grid. First, we propose a flexible table
representation in the form of an MXN grid. In this representation, the vertexes
and edges of the grid store the localization and adjacency information of the
table. Then, we introduce a DETR-style table structure recognizer to
efficiently predict this multi-objective information of the grid in a single
shot. Specifically, given a set of learned row and column queries, the
recognizer directly outputs the vertexes and edges information of the
corresponding rows and columns. Extensive experiments on five challenging
benchmarks which include wired, wireless, multi-merge-cell, oriented, and
distorted tables demonstrate the competitive performance of our model over
other methods.Comment: ACMMM202
PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network
The reading of arbitrarily-shaped text has received increasing research
attention. However, existing text spotters are mostly built on two-stage
frameworks or character-based methods, which suffer from either Non-Maximum
Suppression (NMS), Region-of-Interest (RoI) operations, or character-level
annotations. In this paper, to address the above problems, we propose a novel
fully convolutional Point Gathering Network (PGNet) for reading
arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter,
where the pixel-level character classification map is learned with proposed
PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC
decoder, we gather high-level character classification vectors from
two-dimensional space and decode them into text symbols without NMS and RoI
operations involved, which guarantees high efficiency. Additionally, reasoning
the relations between each character and its neighbors, a graph refinement
module (GRM) is proposed to optimize the coarse recognition and improve the
end-to-end performance. Experiments prove that the proposed method achieves
competitive accuracy, meanwhile significantly improving the running speed. In
particular, in Total-Text, it runs at 46.7 FPS, surpassing the previous
spotters with a large margin.Comment: 10 pages, 8 figures, AAAI 202
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Text images contain both visual and linguistic information. However, existing
pre-training techniques for text recognition mainly focus on either visual
representation learning or linguistic knowledge learning. In this paper, we
propose a novel approach MaskOCR to unify vision and language pre-training in
the classical encoder-decoder recognition framework. We adopt the masked image
modeling approach to pre-train the feature encoder using a large set of
unlabeled real text images, which allows us to learn strong visual
representations. In contrast to introducing linguistic knowledge with an
additional language model, we directly pre-train the sequence decoder.
Specifically, we transform text data into synthesized text images to unify the
data modalities of vision and language, and enhance the language modeling
capability of the sequence decoder using a proposed masked image-language
modeling scheme. Significantly, the encoder is frozen during the pre-training
phase of the sequence decoder. Experimental results demonstrate that our
proposed method achieves superior performance on benchmark datasets, including
Chinese and English text images